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AMENDMENTS TO THE CLAIMS: 



1. (Currently amended) A computer, comprising: 

a processor; 

a memory system; 

a co-processing unit; and 

a plurality of data registers for data exchange with said co-processing unit, 
wherein said computer is a s controlled to implement a method of increasing efficiency 
in executing a matrix operation that uses matrix data in a standard format, said standard 
format comprising one of a column major format and a row major format, said method 
comprising: 

for matrix data stored in said standard format in said memory system , wherein said 
matrix data comprises data of any of a complete matrix, a complete submatrix, or a part of a 
matrix or submatrix, separating using said processor to separate said matrix data into blocks 
of data, each said block having a size p-by-q; and 

rearranging by said processor and placing in a storage said memory system of said 
computer, for retrieval in a repetitive manner for executing said matrix operation, said blocks 
of data to be contiguous blocks of contiguous data such that said matrix data is represented in 
a nonstandard format that permits said matrix data to be moved from said storage memory 
system into a position for performing said matrix operation more quickly than if said matrix 
data had been moved as stored in said standard format. 

2. (Previously presented) The computer of claim 22, wherein said co-processing unit 
comprises a floating point unit (FPU) and said loading said matrix data into said set of data 
registers comprises loading said blocks from said storage into a subset of data registers in said 



2 



Serial No. 10/671,888 

Docket No. YOR920030169US1 (YOR.463) 



set of data registers, using a deviation from a normal floating point loading instruction of the 
floating point unit (FPU) of the computer. 

3. (Canceled) 

4. (Previously presented) The computer of claim 1, wherein said size p-by-q comprises a 2- 
by-2 block. 

5. (Previously presented) The computer of claim 2, wherein said deviation from normal 
floating point loading comprises a crisscrossing of elements about a diagonal of said blocks. 

6. (Previously presented) The computer of claim 2, said method further comprising: 

selectively, at least one of loading input data and storing a result of said matrix 
operation into or out of said co-processing unit from LI cache or memory by at least one of a 
subset of optimal load and store instructions, said loading and storing being dictated by an 
optimal FPU loading or storage instruction. 

7. (Previously presented) The computer of claim 2, wherein said deviation of said normal 
floating point loading instruction, in combination with said nonstandard format, provides a 
result data of a transpose of said matrix data to reside in said data registers of said FPU. 

8. (Previously presented) The computer of claim 2, wherein said loading comprises a 2 x 2 
crisscrossing technique. 
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9. (Previously presented) The computer of claim 6, wherein said linear algebra operation 
comprises one of a BLAS kernel and a factorization kernel. 

10-16. (Canceled) 

17. (Currently amended) A signal bearing computer-readable storage medium tangibly 
embodying a program of machine-readable instructions executable by a digital processing 
apparatus to perform a method of storing information of a matrix in a register block data 
format, said method comprising: 

receiving data for a matrix, said data comprising one of a complete matrix data, a 
complete submatrix data, and a partial matrix or submatrix data, said matrix data being stored 
in one of a standard column format and a standard row format; 

dividing said matrix data into blocks, each said block having a size p-by-q; and 

at least one of: 

storing elements in at least one of said blocks in at least one of a cache and a 
memory in a format in which is elements of said block occupy a location different from an 
original location in said block 

storing, for a repetitive retrieval, said blocks of size p-by-q in a memory in a 
format in which at least one said block occupies a position different from its original position 
in said matrix, 

said register data block format converting the matrix data to no longer be in either of 
said standard column format or said standard row format. 
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18. (Currently amended) The signal bearing computer-readable storage medium of claim 17, 
said method further comprising: 

repetitively loading said blocks from said memory into a plurality of data registers so 
that a format of data in said data registers comprises a transpose data of said matrix. 

19. (Currently amended) The signal bearing computer-readable storage medium of claim 18, 
wherein said repetitively loading comprises a loading using a 2 x 2 crisscrossing technique. 

20. (Canceled) 

21. (Currently amended) The computer of claim 1, wherein said matrix operation is executed 
on a said co-processing unit of said computer and said position for performing said matrix 
operation comprises loading said matrix data onto a set of said data registers of said co- 
processing unit, said method further comprising: 

repetitively retrieving said matrix data from said storage memory system in said 
nonstandard format; and 

loading said matrix data into at least a subset of said set of data registers in an optimal 
format, said optimal format comprising a format of said matrix data in said data registers such 
that a minimal possible time is required to utilize said matrix data in said data registers in said 
matrix operation in said co-processing unit. 

22. (Currently amended) The computer of claim 21, wherein said computer includes at least 
one of a machine architecture and an instruction set having one or more features that are less 
than optimal for executing said matrix operation in said standard format with said co- 
processing unit , and said nonstandard format of matrix data and said optimal format in said 
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data registers together provide a mechanism that overcomes said one or more features that are 
less than optimal for executing said matrix operation. 

23. (Currently amended) A computer comprising: 

a processor; 

a storage; and 

a co-processing unit, 

said computer configured to implement a method of increasing efficiency in executing 
a matrix operation that uses matrix data in a standard format, said standard format comprising 
one of a column major format and a row major fonnat, said method comprising: 

converting , by said processor, at least a part of said matrix data into an pseudo optimal 
matrix format comprising contiguous data that no longer represents said matrix data in said 
standard format, each pseudo optimal matrix format comprising a subset of said matrix data 
that is predetermined to permit a loading of said pseudo matrix data from said storage into a 
said co- processing unit in an optimal format to perform said matrix operation, said optimal 
format comprising a format that allows a minimal possible time in said processing unit to 
utilize said matrix data in said matrix operation. 

24. (Currently amended) The computer of claim 23, said method further comprising 
successively repetitively loading elements of each said pseudo matrix into said processing unit 
for executing said matrix operation, wherein said loading comprises successively repetitively 
placing data of each said pseudo matrix into predetermined registers of a register set of said 
processor co-processing unit in said optimal format. 



6 



Serial No. 10/671,888 

Docket No. YOR920030169US1 (YOR.463) 



25. (Currently amended) The computer of claim 24, said method further comprising: 

processing , by said co-processing unit, said matrix operation on said data in said 
optimal format, a result of said processing being stored in predetermined registers of said 
register set; and 

storing said result from said predetermined registers of said register set into memory 
said storage in said pseudo optimal matrix format. 

26. (Currently amended) A computer comprising: 

a processor: 
a storage: 

a co-processing unit; and 

a plurality of data registers for data exchange with said co-processing unit, 

said computer having at least one of a machine architecture and an instruction set 
having one or more features that are less than optimal for executing a matrix operation, said 
computer configured to implement a method of overcoming said disadvantage by software 
instructions , said method comprising: 

rearranging , by said processor, at least a part of matrix data to be used in said matrix 
operation into a plurality of blocks, each block having size p-by-q, such that said matrix data 
is no longer stored in a standard matrix format comprising one of a row major format and a 
column major format, said rearranged matrix data in said blocks being stored in said storage 
as contiguous blocks of contiguous data in a nonstandard format, 

wherein said nonstandard format of said matrix data is predetermined to allow said 
matrix data to be placed from said storage into a said co- processing unit for processing said 
matrix data in said matrix operation such that said disadvantage on said computer is 
overcome. 
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27. (Currently amended) The computer of claim 26, said method further comprising: 

repetitively loading said matrix data in said nonstandard format from said storage into 
at least a subset of said data registers of said co-processing unit in an optimal format, said 
optimal format comprising a format allowing a minimal possible time in said processing unit 
to utilize said matrix data in said matrix operation. 

28. (Currently amended) A computer comprising: 

a processor: 
a storage: 

a co-processing unit: and 

a plurality of data registers for data exchange with said co-processing unit, 

said computer configured to implement a method of overcoming a hardware 
disadvantage on said computer relative to a specific processing on a specific computer 
architecture/set of instructions using said co-processing unit , said method comprising: 

using first software instructions to preliminarily process input data by said processor 
to be used in said specific processing on said specific computer architecture/set of 
instructions in a manner to generate a first error relative to said specific processing; and 

using second software instructions to subsequently process said input data in a manner 
to generate a correcting error relative to said specific processing, 

wherein first software instructions in combination with said second software 
instructions overcome said disadvantage. 

29. (Currently amended) The computer of claim 30 28, wherein said specific processing 
comprises a matrix operation , said disadvantage comprises a non-optimal loading of matrix 
data from said storage into said co-processing unit, and said first error comprises storing 
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matrix data in said storage in a format that converts matrix data from a standard column 
major or row major format into a nonstandard fonnat predetermined to overcome said 
disadvantage when said data is subjected to said correcting error , and said correcting error 
comprises loading said data in said nonstandard format from said storage into said plurality of 
data registers using a non-standard loading format . 
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